Load Balance and Communication Tradeo s in Parallel
نویسنده
چکیده
In block-partitioned parallel matrix factorization algorithms, where the matrix is distributed over a logical torus processor grid with an rs block-cyclic matrix distribution, the greatest scope for optimization exists in the formation of (block) panels. Let ! be the panel width, with ! m being an optimal value based on the characteristics a single processor's memory hierarchy. To date, two well-known techniques to do this are known as storage blocking, where the ! m ! = r = s, and algorithmic blocking (also known as`dis-tributed panels'), where ! ! m ; r = s 1. These represent strategies at opposite ends of a load balance and communication cost tradeoo In this paper, we present two new techniques for the panel formation, called pipelining with lookahead, and panel scattering. The former requires communication to be uni-directional across a processor dimension, and thus can normally only be applied to the column panel. It can be characterized by ! m ! = s, with communication and computation overlapped across processor columns, at the cost of some pipeline startup time. The latter uses ! m ! = r = s, but involves scattering the panel across its longest dimension across all processors, to be collected and broadcasted when the panel formation is complete. While it achieves perfect load balance, this method can double the communication volume. Implementation issues for these methods will be discussed. For a given target architecture, the optimum method (or combination of methods) depends on the communication to computation performance of that architecture.
منابع مشابه
A Uni ed Algorithm for Load-balancing Adaptive Scienti c Simulations
Adaptive scienti c simulations require that periodic repartitioning occur dynamically throughout the course of the simulation. The computed repartitionings should minimize both the inter-processor communications incurred during the iterative mesh-based computation and the data redistribution costs required to balance the load. Recently developed schemes for computing repartitionings provide the...
متن کاملHow Architecture Evolution Influences the Scheduling Discipline used in Shared-Memory Multiprocessors
Parallel applications execute e ciently only when they distribute their workload among the available processors so that no processors are idle while there is work to do and the interactions among the processors in the form of communication or synchronization overhead is minimized Communication is every form of information exchange including message passing cache misses and non local memory acce...
متن کاملLoad{balance in parallel FACR
Fourier Analysis Cyclic Reduction is a class of very eecient methods for the solution of Poisson's equation on regular grids. We show that exploiting the numerical properties of the tridiagonal systems involved may reduce the factorization work required to a few percent of a normal factorization. We also show that exploiting this property on distributed memory parallel processor architectures m...
متن کاملOn the Competitive Analysis of Randomized Static Load Balancing
Static load balancing is attractive due to its simplicity and low communication costs. We analyze under which circumstances a randomized static load balancer can achieve good balance if the subproblem sizes are unknown and choosen by an adversary. It turns out that this worst case scenario is quite close to a more specialized model for applications related to parallel backtrack search. In both ...
متن کاملOn Runtime Parallel Scheduling
| Parallel scheduling is a new approach for load balancing. In parallel scheduling, all processors cooperate together to schedule work. Parallel scheduling is able to accurately balance the load by using global load information at compile-time or runtime. It provides a high-quality load balancing. This paper presents an overview of the parallel scheduling technique. Particular scheduling algori...
متن کامل